library(roahd)
Functional Data Analysis is one of the core research areas of our department, you have already seen during class that some of the state-of-the-art techniques were actually developed here. Depth-based techniques find a natural application in Functional Statistics. I assume that many of you already know what we mean by functional data (applied statistics course docet), nevertheless do not worry if you do not: I have structured this tutorial as a very basic primer to understand how to do some methodological and applied work in the framework of functional data analysis (FDA).
According to the FDA model, data can be seen as measurements of a quantity/quantities along a given, independent and continuous indexing variable (such as time or space). Observations are then treated as random functions and can be viewed as trajectories of stochastic processes defined on a given infinite dimensional functional space.
A natural way to reason (and actually comprehend why they need specific and peculiar techniques) is to think about them as relatively dense longitudinal data: unlike tabular data, functional data are not invariant to permutations of their dimensions (i.e., if I switch two columns of a standard multivariate dataset no harm is done, if I do it with a functional one, I will create a bloody mess.)
Given its ubiquitous presence in applications FDA has been a very active research field in the past decades, with a plethora of R packages developed for helping practitioner efficiently and effectively employ such techniques. In what follows we will focus our attention on the roahd package (developed by members and former members of our Department of Mathematics): a package meant to collect and provide methods for the analysis of univariate and multivariate functional datasets through the use of robust methods. First off we will dedicate some time in understanding how to represent these complex infinite dimensional objects (our computers only deal with finite approximations) and how to simulate functional data. We will then focus on computation of depths and outlier detection. This lab WILL NOT provide a thorough treatment of FDA, but we will limit to study some introductory concepts on how to deal with univariate (and bivariate, only very briefly) functional datasets.
The way roahd package represents functional objects is by providing an evenly spaced grid \(I=\left[t_{0}, t_{1}, \ldots, t_{P-1}\right]\) \(\left(t_{j}-t_{j-1}=h>0, \: \forall j=1, \ldots, P-1\right)\) over which the functional observations \(D_{i, j}=X_{i}\left(t_{j}\right)\) \(\forall i=1, \ldots, N\) and \(\forall j=0, \ldots, P-1\) are measured. This is very conveniently handled by the fData object class. In particular, the following model is considered for the generation of data: \[
X(t)=m(t)+\varepsilon(t), \text { for all } t \text { in } I
\] where \(m(t)\) is the center and \(\varepsilon(t)\) is a centered gaussian process with covariance function \(C(\cdot,\cdot)\). That is to say: \[
\operatorname{Cov}(\varepsilon(\mathrm{s}), \varepsilon(t))=C(s, t), \text { with } s, t \text { in } I
\] The employed structure for \(C(s, t)\) is the Exponential covariance function:
\[ C(s, t)=\alpha e^{-\beta|s-t|} \]
P <- 101
grid <- seq( 0, 1, length.out = P)
alpha <- 0.2
beta <- 0.2
C_st <- exp_cov_function( grid, alpha, beta )
C_st contains a \(P \times P\) matrix of values.
image( C_st,
main = 'Exponential covariance function',
xlab = 'grid', ylab = 'grid')
After having defined the mean function \(m(t)\)
m <- sin(pi*grid)+sin(2*pi*grid)
we are ready to generate functional data as follows
n <- 100
set.seed(26111992)
data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)
The output of the previous chunk is actually a \(n \times P\) matrix, where \(P\) is equal to the length of the grid. That is, instead of having a “proper” functional datum, I have its evaluation on a relatively fine grid (this is actually the best we can do). We can plot it with:
matplot(grid,t(data), type="l", col=adjustcolor(col=1,alpha.f = .4))
lines(grid,m, col="blue", lwd=5)
Or even better we exploit the potential of the roahd package by constructing a functional object:
f_data <- fData(grid,data)
plot(f_data) # what happens if I do plot(data)?
lines(grid,m, col="black", lwd=5)
The first command defines an object of class fdata
class(f_data)
## [1] "fData"
with dedicated methods for plots, the four basic algebraic operations, subsetting and much more. In particular, using objects of class fdata makes the computation of depth measures and related quantities much easier (see next section). For a thorough account on the potential of the roahd package, check this paper.
By changing the hyperparameters \(\alpha\) and \(\beta\) in the Exponential covariance function, we can generate functional datum with different degree of dependence and variability. In details, \(\alpha\) controls the overall level of variability in the signal, while the parameter \(\beta\) affects the autocorrelation length of the signal’s noise, with lower values of \(\beta\) leading to wider correlation lengths and vice-versa
alpha <- 1
beta <- 0.2
C_st <- exp_cov_function( grid, alpha, beta )
data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)
f_data <- fData(grid,data)
plot(f_data, main="High overall level of variability")
alpha <- .1
beta <- 0.0001
C_st <- exp_cov_function( grid, alpha, beta )
data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)
f_data <- fData(grid,data)
plot(f_data, main="High smothness")
alpha <- .1
beta <- 100
C_st <- exp_cov_function( grid, alpha, beta )
data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)
f_data <- fData(grid,data)
plot(f_data, main="Virtually uncorrelated signals")
Let us now consider the data you have seen in the Case Study: ECG signals. The registered and smoothed signals are contained in the mfD_healthy object.
data("mfD_healthy") #
univariate_fdata <- mfD_healthy$fDList[[1]] # I consider the first lead only
plot(univariate_fdata)
With roadhd it is very easy to compute Band depths and Modified band depths for a given fdata object:
band_depth <- BD(Data = univariate_fdata)
modified_band_depth <- MBD(Data = univariate_fdata)
We can compute the median curve
median_curve <- median_fData(fData = univariate_fdata, type = "MBD") # still an fData object
Or manually by firstly computing the Modified Band Depth and then identifying the curve with max MBD
median_curve_manual <- univariate_fdata[which.max(modified_band_depth),] # still an fData object
all(median_curve_manual$values==median_curve$values)
## [1] TRUE
plot(univariate_fdata)
grid_ecg <- seq(median_curve_manual$t0,median_curve_manual$tP,by=median_curve_manual$h)
lines(grid_ecg,median_curve_manual$values)
Epigraph Index (EI) and Hypograph Index (HI) or their corresponding Modified versions (MEI and MHI) for providing down-upward/up-downward order of data can also be computed: see functions EI, HI, MEI, MHI. As an example, let us compute the Spearman’s correlation index between the first and second lead of the ECG signals. Recall that for a bivariate functional dataset \([\mathbf{x} \mathbf{y}]\)
\[ \hat{\rho}_{s}(\mathbf{x}, \mathbf{y})=\hat{\rho}_{p}\left(I L_{n}-\operatorname{grade}(\mathbf{x}), I L_{n}-\operatorname{grade}(\mathbf{y})\right) \] where \(\hat{\rho}_{p}\) is the sample Pearson correlation coefficient and \[ \begin{aligned} &I L_{n}-\operatorname{grade}(\mathbf{x})=\left(I L_{n}-\operatorname{grade}\left(x_{1}\right), I L_{n}-\operatorname{grade}\left(x_{2}\right), \ldots, I L_{n}-\operatorname{grade}\left(x_{n}\right)\right) \\ &I L_{n}-\operatorname{grade}(\mathbf{y})=\left(I L_{n}-\operatorname{grade}\left(y_{1}\right), I L_{n}-\operatorname{grade}\left(y_{2}\right), \ldots, I L_{n}-\operatorname{grade}\left(y_{n}\right)\right) . \end{aligned} \] where the inferior length grade \(I L_{n}-\operatorname{grade}(x_i)\) can be interpreted as the “proportion of time” that the sample \(x_1,\ldots,x_n\) is smaller than \(x_i\), that is it defines the relative position of a curve with respect to the sample.
The Spearman’s correlation index is immediately obtained via the cor_spearman function
bivariate_data <- as.mfData(list(mfD_healthy$fDList[[1]], mfD_healthy$fDList[[2]]))
plot(bivariate_data)
cor_spearman(bivariate_data, ordering='MEI')
## [1] 0.6840466
Or actually we can manually compute it:
MEI_first_lead <- MEI(bivariate_data$fDList[[1]])
MEI_second_lead <- MEI(bivariate_data$fDList[[2]])
cor(MEI_first_lead, MEI_second_lead)
## [1] 0.6840466
Coding tip: never forget the power of functional programming!
do.call(args = lapply(1:2, function(ind)
MEI(bivariate_data$fDList[[ind]])), what = "cor")
## [1] 0.6840466
We conclude this lab by looking at some graphical tools for identyfing outliers in a functional sample. Consider the following univariate functional data
alpha <- 0.2
beta <- 0.002
C_st <- exp_cov_function( grid, alpha, beta )
data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)
f_data <- fData(grid,data)
We consider two cases: first a dataset with some magnitude outliers, obtained inflating the last 10 curves by a number generated from a uniform distribution in \([2,3]\)
outlier_share <- .1
n_outliers <- n*outlier_share
out_highlighter <- rep(c(1,2),c(n-n_outliers,n_outliers))
f_data_temp <- f_data[1:(n*(1-outlier_share)),] # Coding tip: subsetting is mabe possible by the S3 class fdata
mag_temp <- f_data[(n*(1-outlier_share)+1):n,] * runif(10,2,3)
f_data_mag <- append_fData(f_data_temp,mag_temp)
plot(f_data_mag, col=out_highlighter)
And then some shape outliers by shifting the generating mechanism by the quantity shift_q:
shift_q <- .5
mu_warp=mu=sin(pi*grid+shift_q)+sin(2*pi*grid+shift_q)
shape_temp=generate_gauss_fdata(N = n_outliers, mu_warp, Cov=C_st)
shape_temp=fData(grid,shape_temp)
f_data_shape=append_fData(f_data_temp,shape_temp)
plot(f_data_shape, col=out_highlighter)
Now let us see whether we can employ the two plotting tools we have seen in class, namely functional boxplots and outliegrams to detect magnitude and shape outliers, respectively.
invisible(fbplot(f_data_mag, main="Magnitude outliers"))
invisible(outliergram(f_data_mag))
I used invisible() in the previous chunk to avoid the output of the call to be printed on the console. If I do not use it I obtain the following:
fbplot(f_data_shape, main="Shape outliers")
## $Depth
## [1] 0.11026103 0.46750475 0.31830583 0.50674867 0.51170917 0.41661966
## [7] 0.04915492 0.45632763 0.33511551 0.41957596 0.49002300 0.24140614
## [13] 0.45573557 0.51175718 0.40603860 0.04951495 0.17542754 0.14925693
## [19] 0.51265327 0.34617262 0.40875088 0.25050305 0.48599460 0.51052105
## [25] 0.36885889 0.31769377 0.18042004 0.20834683 0.17305531 0.09131113
## [31] 0.45937594 0.45148715 0.27263726 0.38583658 0.44261026 0.48627863
## [37] 0.50184018 0.21235524 0.39568957 0.12996300 0.30741674 0.51004900
## [43] 0.48460246 0.43974197 0.23552155 0.16881888 0.33869187 0.50578858
## [49] 0.50873287 0.47826583 0.49369937 0.21303930 0.50945295 0.51153715
## [55] 0.42693269 0.47421342 0.02000000 0.31366937 0.30642464 0.47229723
## [61] 0.06980498 0.51137314 0.09376738 0.38274027 0.49405541 0.46875688
## [67] 0.45996000 0.35758976 0.26651265 0.49105911 0.31920592 0.36915492
## [73] 0.50125613 0.13575958 0.49304330 0.42556856 0.50692469 0.35768577
## [79] 0.48397840 0.25093509 0.37405141 0.11496550 0.43172517 0.02425643
## [85] 0.50404040 0.41123512 0.17479548 0.27271727 0.28314631 0.07165717
## [91] 0.19714971 0.37210321 0.38486849 0.35592959 0.36129413 0.32284628
## [97] 0.31031703 0.35520552 0.23547755 0.28801880
##
## $Fvalue
## [1] 1.5
##
## $ID_outliers
## integer(0)
outliergram(f_data_shape)
## $Fvalue
## [1] 1.5
##
## $d
## [1] 5.835831e-04 3.498964e-03 1.497219e-03 2.782456e-03 2.978912e-03
## [6] 2.311201e-03 2.945245e-04 1.808656e-03 8.756519e-04 1.868108e-03
## [11] 3.700964e-03 2.404359e-03 3.269872e-03 3.198142e-03 2.046700e-03
## [16] 1.265077e-04 7.141704e-04 8.440844e-04 2.009468e-03 2.218638e-03
## [21] 2.259077e-03 1.164433e-03 2.380159e-03 1.302388e-03 1.074286e-03
## [26] 3.226818e-03 1.269275e-03 1.483039e-03 9.531646e-04 5.276171e-04
## [31] 3.232640e-03 1.153145e-03 1.355819e-03 2.149601e-03 2.506270e-03
## [36] 3.990062e-03 1.915875e-03 1.040223e-03 2.225609e-03 6.732158e-04
## [41] 2.821391e-03 3.279734e-03 3.171921e-03 2.442700e-03 1.059512e-03
## [46] 7.376183e-04 1.325162e-03 2.188536e-03 3.090566e-03 1.992080e-03
## [51] 4.006024e-03 9.741172e-04 2.542591e-03 2.159899e-03 2.346254e-03
## [56] 2.053948e-03 -4.267420e-16 8.760876e-04 1.130212e-03 3.022758e-03
## [61] 3.414203e-04 3.147998e-03 2.655711e-04 1.609191e-03 1.357878e-03
## [66] 2.600339e-03 2.002378e-03 1.536629e-03 1.362512e-03 1.457651e-03
## [71] 2.086466e-03 1.741441e-03 3.434680e-03 4.401232e-04 2.527223e-03
## [76] 1.637906e-03 3.857415e-03 1.884783e-03 1.762434e-03 1.454046e-03
## [81] 2.137560e-03 3.399548e-04 1.704883e-03 9.042488e-05 2.082862e-03
## [86] 1.582138e-03 5.265477e-04 8.611158e-04 1.638778e-03 1.771266e-04
## [91] 1.012037e-01 1.019168e-01 1.229187e-01 1.406858e-01 9.764470e-02
## [96] 1.345493e-01 7.270656e-02 1.412952e-01 1.190596e-01 1.401026e-01
##
## $ID_outliers
## [1] 91 92 93 94 95 96 97 98 99 100
We appreciate how the shape outliers are for the most part missed by the functional boxplot, whereas the outliergram effectively uncovers them.
As usual, saving the output of the plots to an object allows to recover the ID of the identified outliers:
out_shape <- outliergram(f_data_shape, display = FALSE)
out_shape$ID_outliers
## [1] 91 92 93 94 95 96 97 98 99 100
One last comment: we have seen in class that the value \(F\) used to build the fences in the functional boxplot can be adjusted. Despite the algorithmic implementation/ mathematical theory behind it being quite tedious, the automated selection of \(F\) via the roahd package is actually pretty easy:
set.seed(22)
fbplot(f_data_mag, main="Magnitude outliers",adjust = list(VERBOSE=T))
## * * * Iteration 1 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 2 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 3 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 4 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 5 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 6 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 7 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 8 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 9 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 10 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 11 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 12 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 13 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 14 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 15 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 16 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 17 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 18 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 19 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## * * * Iteration 20 / 20
## * * * * beginning optimization
## * * * * optimization finished.
## $Depth
## [1] 0.17503950 0.48312231 0.35960596 0.51200120 0.51364936 0.44245425
## [7] 0.11152115 0.44876688 0.37394339 0.41193519 0.50070807 0.29320732
## [13] 0.47349735 0.51090909 0.39948595 0.07966197 0.19378938 0.21105111
## [19] 0.51260926 0.38394839 0.43572557 0.30023202 0.49729573 0.50594859
## [25] 0.36475048 0.35919392 0.23865787 0.26359636 0.23252925 0.15602560
## [31] 0.45032703 0.44291029 0.27628163 0.41656766 0.46284028 0.49841184
## [37] 0.50915292 0.26825683 0.42468047 0.19441744 0.35050905 0.51350535
## [43] 0.49743174 0.46089609 0.24200220 0.18662466 0.33632363 0.51172517
## [49] 0.50543254 0.47024102 0.48763476 0.22399640 0.50591659 0.51408941
## [55] 0.45076308 0.46616062 0.05079908 0.31275728 0.30684468 0.48818282
## [61] 0.13441144 0.51074507 0.11877788 0.37915592 0.48681068 0.46055606
## [67] 0.47754975 0.35506551 0.31411341 0.48376638 0.31766977 0.36491449
## [73] 0.50842884 0.15612561 0.48632663 0.41923592 0.50232423 0.39438944
## [79] 0.49541554 0.25597560 0.40713071 0.13696770 0.42526453 0.08418242
## [85] 0.49793979 0.40624262 0.19225323 0.27624162 0.32899090 0.09945195
## [91] 0.15578958 0.22318432 0.18999300 0.22654065 0.06500850 0.20206221
## [97] 0.15660166 0.18280828 0.18657666 0.03113311
##
## $Fvalue
## [1] 1.547542
##
## $ID_outliers
## [1] 91 92 93 94 95 96 97 98 99 100